---
title: "Define datasets"
description: "Design, version, and validate datasets used for optimization runs."
---

The optimizer evaluates candidate prompts against datasets stored in Opik. If you are brand new to datasets in Opik, start with [Manage datasets](/evaluation/manage_datasets); this page highlights specific tips to get you started.

Datasets are a crucial component of the optimizer SDK, serving as a key component to run and evaluate (score) each dataset item using optimizers to develop a better outcome. Without datasets, it's not possible to steer the optimizer on what is good and bad.

## Dataset schema

Every item is a JSON object. Required keys depend on your prompt template; optional keys help with analysis. Schemas are optional—define only the fields your prompt or metrics actually consume.

| Field | Purpose |
| --- | --- |
| `inputs` (e.g., `question`, `context`) | Values substituted into your `ChatPrompt` placeholders. |
| `answer` / `label` | Ground truth used by metrics. |
| `metadata` | Arbitrary dict for tagging scenario, split, or difficulty. |

## Create or load datasets

<Steps>
  <Step title="Create via SDK">
    ```python
    import opik

    client = opik.Opik()
    dataset = client.get_or_create_dataset(name="agent-opt-support")
    dataset.insert([
        {"question": "Summarize Opik.", "answer": "Opik is an LLM observability platform."},
        {"question": "List two optimizer types.", "answer": "MetaPrompt and Hierarchical Reflective."},
    ])
    ```
  </Step>
  <Step title="Upload from file">
    - Prepare a CSV or Parquet file with column headers that match your prompt variables.
    - Load the file via Python (e.g., pandas) and call `dataset.insert(...)` or related helpers from the [Dataset SDK](https://www.comet.com/docs/opik/python-sdk-reference/evaluation/Dataset.html).
    - Verify in the UI that rows include `metadata` if you plan to filter by scenario.
  </Step>
  <Step title="Use built-in samples">
    The optimizer SDK provides ready-made datasets for quick experiments:
    ```python
    from opik_optimizer import datasets
    hotpot = datasets.hotpot(count=300)
    tiny = datasets.tiny_test()
    ```
    These datasets live in `sdks/opik_optimizer/src/opik_optimizer/datasets` and mirror the notebook examples.
  </Step>
</Steps>

## Train/validation splits

**Overfitting** occurs when an optimized prompt performs well on the examples it was trained on but fails to generalize to new, unseen data. To prevent this, split your dataset into separate sets:

- **Training dataset** (70-80%): Used by the optimizer to generate prompt improvements
- **Validation dataset** (20-30%): Used to evaluate and rank candidate prompts during optimization, helping select prompts that generalize well
- **Test dataset** (optional, separate): Held out completely until after optimization to measure final real-world performance

The optimizer uses the training set for learning and the validation set for selection, ensuring the best prompt works beyond the training examples.

```python
import opik

client = opik.Opik()

# Create training dataset (70-80% of your data)
training_dataset = client.get_or_create_dataset(name="agent-opt-train")
training_dataset.insert([
    {"question": "What is Opik?", "answer": "Opik is an LLM observability platform."},
    {"question": "List optimizer types.", "answer": "MetaPrompt, Evolutionary, etc."},
    # ... more training examples
])

# Create validation dataset (20-30% of your data)
validation_dataset = client.get_or_create_dataset(name="agent-opt-val")
validation_dataset.insert([
    {"question": "Explain Opik's purpose.", "answer": "Opik helps monitor LLMs."},
    {"question": "Name two optimizers.", "answer": "GEPA and Few-Shot Bayesian."},
    # ... more validation examples
])

# Use both during optimization
result = optimizer.optimize_prompt(
    prompt=my_prompt,
    dataset=training_dataset,
    validation_dataset=validation_dataset,
    metric=my_metric,
)
```

**Split recommendations:**
- **70/30 or 80/20** is standard for training/validation splits
- **Ensure diversity** in both sets to cover different scenarios
- **Keep validation data unseen** during prompt development
- **Use the same distribution** in both sets to ensure valid evaluation

### Testing on held-out data

After optimization completes, evaluate the final prompt on a completely held-out test dataset to confirm it generalizes to production scenarios:

```python
from opik.evaluation import evaluate_prompt

# After optimization, test on unseen data
test_dataset = client.get_dataset(name="agent-opt-test")

test_results = evaluate_prompt(
    prompt=result.prompt,  # Best prompt from optimization
    dataset=test_dataset,
    scoring_metrics=[my_metric],
    task_threads=4,
)

print(f"Test score: {test_results.mean_scores}")
```

This final test score gives you confidence that improvements will transfer to real-world usage.

## Best practices

- **Keep datasets immutable** during an optimization run; create a new dataset version if you need to add rows.
- **Use validation datasets** to avoid overfitting—split your data 70/30 or 80/20 between training and validation sets.
- **Log context** fields if you run RAG-style prompts so failure analyses can surface missing passages.
- **Track splits via metadata** (e.g., `metadata["split"] = "eval"`) for additional organization beyond separate datasets.
- **Document ownership** using dataset descriptions so teams know who curates each collection.
- **Keep schema + prompt in sync** – if your prompt expects `{context}`, ensure every dataset row defines that key or provide defaults in the optimizer.

## Validation checklist

- Confirm row counts in the Opik **Datasets** tab (or by running `len(dataset.get_items())` in Python) before and after uploads.
- Spot-check rows in the dashboard’s Dataset viewer.
- If rows include multimodal assets or tool payloads, confirm they appear in the trace tree once you run an optimization.
- Run an initial small-batch optimization with a few rows of data to validate everything end to end.

## Next steps

Define how you will score results with [Define metrics](/agent_optimization/optimization/define_metrics), then follow [Optimize prompts](/agent_optimization/optimization/optimize_prompts) to launch experiments. For domain-specific scoring, extend the dataset with extra fields and reference them inside [Custom metrics](/agent_optimization/advanced/custom_metrics).
